INTERSPEECH.2013 - Analysis and Assessment

Total: 54

#1 Using phonetic feature extraction to determine optimal speech regions for maximising the effectiveness of glottal source analysis [PDF] [Copy] [Kimi1]

Authors: John Kane ; Irena Yanushevskaya ; John Dalton ; Christer Gobl ; Ailbhe Ní Chasaide

Parameterisation of the glottal source has become increasingly useful for speech technology. For many applications it may be desirable to restrict the glottal source feature data to only speech regions where it can be reliably extracted. In this paper we exploit the previously proposed set of binary phonetic feature extractors to help determine optimal regions for glottal source analysis. Besides validation of the phonetic feature extractors, we also quantitatively assess their usefulness for improving voice quality classification and find highly significant reductions in error rates in particular when nasals and fricative regions are excluded.

#2 Beyond bandlimited sampling of speech spectral envelope imposed by the harmonic structure of voiced sounds [PDF] [Copy] [Kimi1]

Authors: Hideki Kawahara ; Masanori Morise ; Tomoki Toda ; Ryuichi Nisimura ; Toshio Irino

A new spectral envelope estimation procedure is proposed to recover details beyond band limitation imposed by the Shannonfs sampling theory when interpreting periodic excitation of voiced sounds as the sampling operation in the frequency domain. The proposed procedure is a hybrid of STRAIGHT, a F0-adaptive spectral envelope estimation and the auto regressive model parameter estimation. Wavelet analyses of these spectral models on the frequency domain enabled objective evaluation of this recovery procedure. The proposed procedure provides better speech quality especially when parameter manipulation is introduced.

#3 A source-filter based adaptive harmonic model and its application to speech prosody modification [PDF] [Copy] [Kimi1]

Authors: JeeSok Lee ; Frank K. Soong ; Hong-Goo Kang

This paper presents a source-filter based adaptive harmonic model (aHM) that can modify prosody of given speech signals. Although the conventional aHM generates a homogeneous replication of the input speech, it is not suitable for prosody modification since temporal and spectral information are interweaved. The proposed method overcomes such limitation by further decomposing the harmonic parameter extracted from aHM into source and filter related components. By applying source-filter structure to aHM, the proposed algorithm can modify pitch of the synthesized speech with introducing only minor degradation. Both objective and subjective test results show that the proposed algorithm can naturally manipulate pitch contour, of which performance is much better than conventional algorithms such as pitch synchronous overlap add (PSOLA) and speech transformation and representation using adaptive interpolation of weighted spectrum (STRAIGHT).

#4 Detection of glottal opening instants using Hilbert envelope [PDF] [Copy] [Kimi1]

Authors: K. Ramesh ; S. R. M. Prasanna ; D. Govind

The objective of this work is to develop an automatic method for estimating glottal opening instants (GOIs) using Hilbert envelope (HE). The GOIs are secondary major excitations after glottal closure instants (GCIs) during the production of voiced speech. The HE is defined as the magnitude of complex time function (CTF) of a given signal. The unipolar property of HE is exploited for picking the second largest peak present in a given glottal cycle and hypothesize as glottal opening instant (GOI). The electroglottogram (EGG) / speech signal is first passed through the zero frequency filtering (ZFF) method to extract GCIs. With the help of detected GCIs, the secondary peaks present in the HE of dEGG / residual are hypothesized as GOIs. The hypothesized GOIs are compared with secondary peaks estimated from the dEGG / residual. The GOIs hypothesized by the proposed method show less variance compared to peak picking from dEGG / residual.

#5 Robust formant detection using group delay function and stabilized weighted linear prediction [PDF] [Copy] [Kimi1]

Authors: Dhananjaya Gowda ; Jouni Pohjalainen ; Mikko Kurimo ; Paavo Alku

In this paper, we propose a robust spectral representation for detecting formants in heavily degraded conditions. The method combines the temporal robustness of the stabilized weighted linear prediction (SWLP) with the robustness of group delay (GD) function in the frequency domain. Weighting of the cost function in linear prediction analysis with the short-time energy of the speech signal improves the robustness of the resultant spectrum. It also improves the accuracy of the estimated resonances as the weighting function gives more weightage to the closed phase of the glottal cycle, which is also the high SNR region of the signal. The group delay spectrum computed as the sum of individual resonances denoted by the roots of the SWLP coefficients, improves the robustness of weaker higher order resonances. The proposed SWLP-GD spectrum performs better than the conventional LP spectrum and the STRAIGHT spectrum in terms of spectral distortion measure and formant detection accuracies.

#6 A source-filter separation algorithm for voiced sounds based on an exact anticausal/causal pole decomposition for the class of periodic signals [PDF] [Copy] [Kimi1]

Authors: Thomas Hézard ; Thomas Hélie ; Boris Doval

This paper addresses the source-filter separation problem in the context of causal/anticausal linear filter model of voice production. An algorithm based on standard signal processing tools is proposed for the class of quasi-periodic signals (voiced sounds with quasi-stationary pitch). At first, a one-period frame of an equivalent stationary infinitely periodic signal is built. A particular attention is given to the problems of windowing and temporal aliasing. Secondly, an exact pole decomposition of this signal is computed within the class of T0-periodic signals. Finally, the glottal closure instant (GCI) and the causal-anticausal factorization of the initial frame are jointly estimated from the latter decomposition. The performance of this algorithm on synthetic signals is demonstrated and the performance on real speech is discussed. In conclusion, application of this new algorithm in a complete voice analysis-synthesis system is discussed.

#7 Assessing the intelligibility impact of vowel space expansion via clear speech-inspired frequency warping [PDF] [Copy] [Kimi1]

Authors: Elizabeth Godoy ; M. Koutsogiannaki ; Yannis Stylianou

Among the key acoustic features attributed with the intelligibility gain of Clear speech are the observed reduction in speaking rate and expansion of vowel space, representing greater articulation and vowel discrimination. Considering the slower speaking rate, previous works have attempted to assess the intelligibility impact of time-scaling casual speech to mimic Clear speech. In a complementary fashion, this work addresses the latter of the key traits observed in Clear speech, notably vowel space expansion. Specifically, a novel Clear speech-inspired frequency warping method is described and shown to successfully achieve vowel space expansion when applied to casual speech. The intelligibility impact resulting from this expansion is then evaluated objectively and subjectively through formal listening tests. Much like the relevant time-scaling works, the frequency warping that expands vowel space is not shown to yield intelligibility gains. The implications are thus that further analyses and studies are merited in order to isolate the pertinent acoustic-phonetic cues that lead to the improved intelligibility of Clear speech.

#8 Prediction of intelligibility of noisy and time-frequency weighted speech based on mutual information between amplitude envelopes [PDF] [Copy] [Kimi1]

Authors: Jesper Jensen ; Cees H. Taal

This paper deals with the problem of predicting the average intelligibility of noisy and potentially processed speech signals, as observed by a group of normal hearing listeners. We propose a prediction model based on the hypothesis that intelligibility is monotonically related to the amount of Shannon information the critical-band amplitude envelopes of the noisy/processed signal convey about the corresponding clean signal envelopes. The resulting intelligibility predictor turns out to be a simple function of the correlation between noisy/processed and clean amplitude envelopes. The proposed predictor performs well (ƒÏ > 0.95) in predicting the intelligibility of speech signals contaminated by additive noise and potentially non-linearly processed using time-frequency weighting.

#9 Frequency-adaptive post-filtering for intelligibility enhancement of narrowband telephone speech [PDF] [Copy] [Kimi1]

Authors: Emma Jokinen ; Marko Takanen ; Paavo Alku

Post-filtering methods are used in mobile communications to improve the quality and intelligibility of speech. This paper introduces a frequency-adaptive post-filtering algorithm that selects from a predefined set of filters the one that reallocates the largest amount of speech energy from low to high frequencies. The proposed method and another post-filtering technique were compared to unprocessed speech in subjective listening tests in terms of intelligibility. The results indicate that the proposed method outperforms the reference method in difficult noise conditions.

#10 Comparative investigation of objective speech intelligibility prediction measures for noise-reduced signals in Mandarin and Japanese [PDF] [Copy] [Kimi1]

Authors: Junfeng Li ; Fei Chen ; Masato Akagi ; Yonghong Yan

In this paper, eight state-of-the-art objective speech intelligibility prediction measures are comparatively investigated for noisy signals before and after noise-reduction processing between Mandarin and Japanese. Clean speech signals (Chinese words and Japanese words) were first corrupted by three types of noise at two signal-to-noise ratios and then processed by normal-hearing listeners for recognition, whose intelligibility was subsequently predicted by objective measures. Further investigations were conducted for objective measures in predicting speech intelligibility of noise-reduced signals between subjective evaluation scores and objective prediction results, and of noisy signals before and after noise-reduction processing, in terms of correlation analysis and prediction errors. Results showed that the majority of objective measures behave differently for Mandarin and Japanese in predicting the subjective ratings, and the STOI measure consistently provided the best ability in predicting the effect on speech intelligibility of the noise-reduction processing for both Mandarin and Japanese.

#11 Monitoring the effects of temporal clipping on voIP speech quality [PDF] [Copy] [Kimi1]

Authors: Andrew Hines ; Jan Skoglund ; Anil Kokaram ; Naomi Harte

This paper presents work on a real-time temporal clipping monitoring tool for VoIP. Temporal clipping can occur as a result of voice activity detection (VAD) or echo cancellation where comfort noise in used in place of clipped speech segments. The algorithm presented will form part of a no-reference objective model for quantifying perceived speech quality in VoIP. The overall approach uses a modular design that will help pinpoint the reason for degradations in addition to quantifying their impact on speech quality. The new algorithm was tested for VAD compared over a range of thresholds and varied speech frame sizes. The results are compared to objective Mean Opinion Scores (MOS-LQO) from POLQA. The results show that the proposed algorithm can efficiently predict temporal clipping in speech and correlates well with the full reference quality predictions from POLQA. The model shows good potential for use in a real-time monitoring tool.

#12 The spectral dynamics of vowels in Mandarin Chinese [PDF1] [Copy] [Kimi1]

Author: Jiahong Yuan

This study investigated the dynamic spectral patterns of vowels in Mandarin Chinese using a corpus of monosyllabic words spoken in isolation. Mel-frequency cepstral coefficients (MFCCs) were parameterized in different ways to test the nature of the dynamic information in vowels through automatic vowel classification. Compared to the MFCCs extracted at the vowel midpoint, using the MFCCs extracted at two or three points (vowel onset, offset, and midpoint) greatly improved classification accuracies. Legendre polynomials fitted to the MFCCs over the entire vowel duration achieved approximately 30% relative error reductions over the three-point model. Euclidean cepstral distance was employed to measure the magnitude of spectral change. A negative correlation was found between the rate of spectral change and vowel duration. Vowel-dependent spectral changes appear primarily in the first half of a vowel. There is great diversity among the diphthongs and a considerable overlap between the diphthongs and the monophthongs in terms of the spectral dynamics.

#13 Pitch-gesture modeling using subband autocorrelation change detection [PDF] [Copy] [Kimi1]

Authors: Malcolm Slaney ; Elizabeth Shriberg ; Jui-Ting Huang

Calculating speaker pitch (or F0) is typically the first computational step in modeling tone and intonation for spoken language understanding. Usually pitch is treated as a fixed, single-valued quantity. The inherent ambiguity judging the octave of pitch, as well as spurious values, leads to errors in modeling pitch gestures that propagate in a computational pipeline. We present an alternative that instead measures changes in the harmonic structure using a subband autocorrelation change detector (SACD). This approach builds upon new machine-learning ideas for how to integrate autocorrelation information across subbands. Importantly however, for modeling gestures, we preserve multiple hypotheses and integrate information from all harmonics over time. The benefits of SACD over standard pitch approaches include robustness to noise and amount of voicing. This is important for real-world data in terms of both acoustic conditions and speaking style. We discuss applications in tone and intonation modeling, and demonstrate the efficacy of the approach in a Mandarin Chinese tone-classification experiment. Results suggest that SACD could replace conventional pitch-based methods for modeling gestures in selected spoken-language processing tasks.

#14 Analysis of emotional speech at subsegmental level [PDF] [Copy] [Kimi1]

Authors: P. Gangamohan ; Sudarsana Reddy Kadiri ; B. Yegnanarayana

Emotional speech is produced when a speaker is in a state different from normal state. The objective of this study is to explore the deviations in the excitation source features of an emotional speech compared to normal speech. The features used for analysis are extracted at subsegmental level (1.3 ms) of speech. A comparative study of these features across different emotions indicates that there are significant deviations in the subsegmental level features of speech in emotional state when compared to normal state.

#15 Periodicity extraction for voiced sounds with multiple periodicity [PDF] [Copy] [Kimi1]

Authors: Masanori Morise ; Hideki Kawahara ; Kenji Ozawa

A periodicity extraction method is introduced to analyze voiced sounds with a complex excitation behavior. Although general voiced sound has only one periodicity, some voiced sounds such as the pathological voice and the singing voice often have multiple periodicities. A method for estimating multiple periodicities from voiced sounds to deal with these kinds of voices is proposed in this article. At first, a definition of the multiple periodicity and its causes are explained, and then the principle of the proposed method is introduced. The proposed method was evaluated by using several artificial signals and pathological voices recorded in a real environment. The analysis results from the artificial signals indicated that the proposed method can extract multiple periodicities, and that of the pathological voices shows a similar tendency. These results suggest that the proposed method is effective at extracting the multiple periodicities.

#16 Modelling and estimation of the fundamental frequency of speech using a hidden Markov model [PDF] [Copy] [Kimi1]

Authors: John H. Taylor ; Ben Milner

This paper proposes using a hidden Markov model (HMM) to model a speech signal in terms of its speech class (voiced, unvoiced and nonspeech) and for voiced speech its fundamental frequency. States of the HMM represent unvoiced speech and nonspeech with multiple voiced states that model different fundamental frequencies. The transition matrix of the HMM models temporal changes in speech class and the time-varying fundamental frequency contour. The model is then applied to voicing and fundamental frequency estimation by extracting acoustic features from a speech signal and then applying Viterbi decoding. Experimental results are presented that investigate the estimation accuracy of the proposed system and a comparison is made against conventional methods.

#17 Extended weighted linear prediction using the autocorrelation snapshot — a robust speech analysis method and its application to recognition of vocal emotions [PDF] [Copy] [Kimi1]

Authors: Jouni Pohjalainen ; Paavo Alku

Temporally weighted linear predictive methods have recently been successfully used for robust feature extraction in speech and speaker recognition. This paper introduces their general formulation, where various efficient temporal weighting functions can be included in the optimization of the all-pole coefficients of a linear predictive model. Temporal weighting is imposed by multiplying elements of instantaneous autocorrelation "snapshot" matrices computed from speech data. With this novel autocorrelation-snapshot formulation of weighted linear prediction, it is demonstrated that different temporal aspects of speech can be emphasized in order to enhance robustness of feature extraction in speech emotion recognition.

#18 Improving the accuracy and the robustness of harmonic model for pitch estimation [PDF] [Copy] [Kimi1]

Authors: Meysam Asgari ; Izhak Shafran

Accurate and robust estimation of pitch plays a central role in speech processing. Various methods in time, frequency and cepstral domain have been proposed for generating pitch candidates. Most algorithms excel when the background noise is minimal or for specific types of background noise. In this work, our aim is to improve the robustness and accuracy of pitch estimation across a wide variety of background noise conditions. For this we have chosen to adopt, the harmonic model of speech, a model that has gained considerable attention recently. We address two major weakness of this model. The problem of pitch halving and doubling, and the need to specify the number of harmonics. We exploit the energy of frequency in the neighborhood to alleviate halving and doubling. Using a model complexity term with a BIC criterion, we chose the optimal number of harmonics. We evaluated our proposed pitch estimation method with other state of the art techniques on Keele data set in terms of gross pitch error and fine pitch error. Through extensive experiments on several noisy conditions, we demonstrate that the proposed improvements provide substantial gains over other popular methods under different noise levels and environments.

#19 A comparative study of glottal open quotient estimation techniques [PDF] [Copy] [Kimi1]

Authors: John Kane ; Stefan Scherer ; Louis-Philippe Morency ; Christer Gobl

The robust and efficient extraction of features related to the glottal excitation source has become increasingly important for speech technology. The glottal open quotient (OQ) is one relevant measurement which is known to significantly vary with changes in voice quality on a breathy to tense continuum. The extraction of OQ, however, is hampered in the time-domain by the difficulty in consistently locating the point of glottal opening as well the computational load of its measurement. Determining OQ correlates in the frequency domain is an attractive alternative, however the lower frequencies of glottal source spectrum are also affected by other aspects of the glottal pulse shape thereby precluding closed-form solutions and straightforward mappings. The present study provides a comparison of three OQ estimation methods and shows a new method based on spectral features and artificial neural networks to outperform existing methods in terms of discrimination of voice quality, lower error values on a large volume of speech data and dramatically reduced computation time.

#20 Estimation of multiple-branch vocal tract models: the influence of prior assumptions [PDF] [Copy] [Kimi1]

Authors: Christian H. Kasess ; Wolfgang Kreuzer

Branched-tube models can be used for modeling nasal speech such as nasal stops and nasalized vowels. Previously, it has been shown that the use of probabilistic prior information such as smoothness priors can reduce the within-speaker variability of the vocal tract estimates. This model, however, lacked a representation of paranasal cavities and thus a model with a more complex branching structure is desirable. This raises the question of what prior information is necessary for physically plausible parameter estimates. Here, a model with one maxillary sinus is estimated. The sinus is parameterized in terms of its resonance using radius and angle in the z-plane, and the coupling area ratio. The probabilistic scheme mentioned above is used to estimate nasal stops /m/ and /n/ extracted from the TIMIT database. Different prior assumptions concerning resonance frequency, bandwidth, and coupling of the sinus to the nasal cavity are tested. Results show, on average, a better model fit for the model including the sinus. Further, prior assumptions are shown to have a large influence on the estimated resonance of the sinus. In particular, the lack of anatomically motivated assumptions about the bandwidth and/or the resonance frequency yields unrealistic estimates of these values.

#21 Detecting overlapping speech with long short-term memory recurrent neural networks [PDF] [Copy] [Kimi1]

Authors: Jürgen T. Geiger ; Florian Eyben ; Björn Schuller ; Gerhard Rigoll

Detecting segments of overlapping speech (when two or more speakers are active at the same time) is a challenging problem. Previously, mostly HMM-based systems have been used for overlap detection, employing various different audio features. In this work, we propose a novel overlap detection system using Long Short-Term Memory (LSTM) recurrent neural networks. LSTMs are used to generate framewise overlap predictions which are applied for overlap detection. Furthermore, a tandem HMM-LSTM system is obtained by adding LSTM predictions to the HMM feature set. Experiments with the AMI corpus show that overlap detection performance of LSTMs is comparable to HMMs. The combination of HMMs and LSTMs improves overlap detection by achieving higher recall.

#22 Evaluation of fundamental validity in applying AR-HMM with automatic topology generation to pathology voice analysis [PDF] [Copy] [Kimi1]

Author: Akira Sasou

Voice-pathology detection from a subject's voice is a promising technology for the pre-diagnosis of larynx diseases. Glottal source estimation in particular plays a very important role in voice-pathology analysis. To more accurately estimate the spectral envelope and glottal source of the pathology voice, we propose a method that can automatically generate the topology of the Glottal Source Hidden Markov Model (GS-HMM), as well as estimate the Auto-Regressive (AR)-HMM parameter by combining the AR-HMM parameter estimation method and the Minimum Description Length-based Successive State Splitting (MDL-SSS) algorithm. This paper evaluates the fundamental validity of pathology-voice analysis based on the proposed method. The experiment results confirmed the feasibility and fundamental validity of the proposed method.

#23 Significance of instants of significant excitation for source modeling [PDF] [Copy] [Kimi1]

Authors: Nagaraj Adiga ; S. R. M. Prasanna

The objective of this work is to demonstrate the significance of instants of significant excitation for source modeling. Instants of significant excitation correspond to the glottal closure, glottal opening, onset of burst, frication and a small number of excitation instants around them. The speech signal is processed independently by zero frequency filtering (ZFF) to obtain epochs. The epochs are used as anchor points for extracting the instants of significant excitation from different representations of speech. The different representations include sequence of strength weighted epochs, small range of samples around epochs from the linear prediction (LP) residual, Hilbert envelope (HE) of LP residual and the cosine of phase sequence. The strength weighted epoch sequence generates speech which is intelligible, but synthetic in nature. By considering a small region of instants of significant excitation around the epochs, the naturalness of synthesized speech increases significantly.

#24 Significance of variable height-bandwidth group delay filters in the spectral reconstruction of speech [PDF] [Copy] [Kimi3]

Authors: Devanshu Arya ; Anant Raj ; Rajesh M. Hegde

The significance of varying the height and bandwidth of the group delay spectrum is hitherto unexplored in the spectral reconstruction of speech signals. In this paper, a family of variable height-bandwidth filters are designed to evaluate their performance in the reconstruction of speech. The design procedure for higher order group delay filters as a cascade of second order filters is first described. These higher order filters enable the synthesis of speech sounds by simultaneously varying the height and bandwidth of the group delay spectrum. The group delay filter response is corrected by removing zeros in close proximity to the unit circle which give rise to abrupt phase transitions at points of significant excitation. Experiments on spectral reconstruction and perception of speech using variable height bandwidth group delay filters are conducted to compute the optimal height and bandwidth of the group delay filter. The experimental results indicate that the optimal height bandwidth obtained from a family of variable height bandwidth group delay filters does indeed improve the spectral reconstruction and perception of speech sounds when compared to fixed height bandwidth group delay filters.

#25 Nonlinear prediction of speech signal using volterra-wiener series [PDF] [Copy] [Kimi1]

Authors: Hemant A. Patil ; Tanvina B. Patel

Linear Prediction (LP) analysis has proven to be very effective and successful in speech analysis and speech synthesis applications. This may be due to the fact that LP analysis captures implicitly the time-varying vocal tract area function. However, it captures only the second-order statistical relationships and only the linear dependencies in the sequence of samples of speech signals (and not the higher-order relations), as a result of which the LP residual is also intelligible. This paper studies the effectiveness of nonlinear prediction (NLP) of the speech signal by using the state-of-the-art Volterra-Wiener series and uses a novel chaotic titration method to analyze the chaotic characteristics of the residual obtained by both the LP and NLP methods. The experimental results demonstrate that the proposed NLP approach gives less prediction error, relatively flat residual spectrum, less PESQ score (i.e., objective evaluation of MOS to a certain extent) and less chaoticity than its LP counterpart. Finally, the L1 norm and L2 norm of NLP residual was found be relatively less than LP residual for five instances of voiced and unvoiced regions extracted from speakers of TIMIT database.